The Boston Housing Dataset contains 506 census tracts from the 1970 Boston census with 19 variables including housing characteristics, environmental factors, and geographic coordinates. Our objective is to predict corrected median home values (cmedv) using dimensionality reduction (PCA, Factor Analysis) and regression with resampling methods. The spatial distribution shows dense clustering in Boston’s urban core with visible coastline, confirming geographic accuracy.

# Importing libraries
library(mlbench)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tibble)
library(knitr)
library(ggcorrplot)
library(mapview)
library(sf)
## Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
library(GGally)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Information about dataset

The Boston Housing Dataset is good for dimensionality reduction techniques like factor analysis and PCA because it has a good number of quantitative variables and is well-structured and easy to visualize. It also has many practical applications, and we liked that it was a strong dataset to gather real-world interpretations from.

The following are the columns of the dataset:

Loading data

Here we are using a dataset that was corrected using Google’s Geocoder as the original latitude/longitude was not accurate and some of the houses were mapped to being in the ocean.

However, this new corrected set means that there are some latitude/longitude outliers.

# Loading fixed dataset
BostonHousing2 <- read.csv("boston_fixed.csv")

# Making all column names lower case
names(BostonHousing2) <- tolower(names(BostonHousing2))

BostonHousing2$lat <- as.numeric(BostonHousing2$lat)
BostonHousing2$lon <- as.numeric(BostonHousing2$lon)

summary(BostonHousing2[, c("lat", "lon")])
##       lat             lon         
##  Min.   :26.09   Min.   :-122.72  
##  1st Qu.:42.27   1st Qu.: -71.23  
##  Median :42.35   Median : -71.10  
##  Mean   :42.10   Mean   : -74.46  
##  3rd Qu.:42.41   3rd Qu.: -71.04  
##  Max.   :48.48   Max.   : -69.11
# Boxplots of numeric columns
numeric_cols <- sapply(BostonHousing2, is.numeric)
boxplot(BostonHousing2[, numeric_cols],
        main = "Boxplots of Numeric Variables (Boston Fixed)",
        col = "lightblue", las = 2)

# Map visualization
mapview(
  BostonHousing2,
  xcol = "lon",
  ycol = "lat",
  crs = 4326,
  grid = TRUE,
  layer.name = "Boston Housing (Fixed)"
)

Data preprocessing

Below we are going to do some preprocessing on the data including dealing with outliers, checking if there are any NA values, and other feature engineering.

Outliers

Here, we are identifying outliers (extreme values) in the Boston Housing dataset. Examples of this can be found above in the way that some of the latitude/longitude values are severely off.

Outliers: Removing outlier latitude/longitude data values

# We know what the expected coordinate range of living in Boston is
lat_min <- 42.0; lat_max <- 43.0
lon_min <- -71.5; lon_max <- -70.5

# --- Flag outliers ---
BostonHousing2$outlier <- with(BostonHousing2,
  lat < lat_min | lat > lat_max | lon < lon_min | lon > lon_max
)

cat("⚠️  Detected", sum(BostonHousing2$outlier), "coordinate outliers\n")
## ⚠️  Detected 77 coordinate outliers
# --- Remove outliers ---
BostonHousing2_clean <- subset(BostonHousing2, outlier == FALSE)
cat("✅ Remaining rows after cleanup:", nrow(BostonHousing2_clean), "\n")
## ✅ Remaining rows after cleanup: 429
# Re-visualize cleaned up map

# More boxplots (post-cleaning)
numeric_cols_clean <- sapply(BostonHousing2_clean, is.numeric)
boxplot(BostonHousing2_clean[, numeric_cols_clean],
        main = "Boxplots (Cleaned BostonHousing2)",
        col = "lightgreen", las = 2)

# --- Updated map ---
mapview(
  BostonHousing2_clean,
  xcol = "lon",
  ycol = "lat",
  crs = 4326,
  grid = TRUE,
  layer.name = "Boston Housing – Cleaned"
)

Outliers: Checking if any outliers in cmedv (median value of owner-occupied homes)

# Load libraries
library(MASS)      # For BostonHousing2 dataset
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(ggplot2)
library(tidyr)     # For pivot_longer()

# -----------------------------
# 1. Boxplot for cmedv
# -----------------------------
ggplot(BostonHousing2, aes(y = cmedv)) +
  geom_boxplot(fill = "lightblue", color = "blue",
               outlier.color = "red", outlier.shape = 16) +
  theme_minimal() +
  labs(title = "Boxplot of Median Home Value (cmedv)",
       y = "Median Value ($1000s)")

# -----------------------------
# 2. Outlier detection - 3 sigma rule
# -----------------------------
mu <- mean(BostonHousing2$cmedv, na.rm = TRUE)
sigma <- sd(BostonHousing2$cmedv, na.rm = TRUE)

n_outliers_sigma <- sum(BostonHousing2$cmedv < mu - 3*sigma |
                        BostonHousing2$cmedv > mu + 3*sigma, na.rm = TRUE)
cat("Outliers detected by 3-sigma rule:", n_outliers_sigma, "\n")
## Outliers detected by 3-sigma rule: 0
# -----------------------------
# 3. Outlier detection - IQR method
# -----------------------------
QI <- quantile(BostonHousing2$cmedv, 0.25, na.rm = TRUE)
QS <- quantile(BostonHousing2$cmedv, 0.75, na.rm = TRUE)
IQR_val <- QS - QI

n_outliers_iqr <- sum(BostonHousing2$cmedv < QI - 1.5*IQR_val |
                      BostonHousing2$cmedv > QS + 1.5*IQR_val, na.rm = TRUE)
cat("Outliers detected by IQR method:", n_outliers_iqr, "\n")
## Outliers detected by IQR method: 39
# -----------------------------
# 4. Boxplots for all numeric variables
# -----------------------------
# Select numeric columns manually using base R
numeric_vars <- BostonHousing2[, sapply(BostonHousing2, is.numeric)]

# Pivot and plot
numeric_vars %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = variable, y = value, fill = variable)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 16) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Boxplots of Numeric Variables in BostonHousing2",
       x = "Variable",
       y = "Value")

Even though there are 39 missing values when we use IQR method, looking at those specific outliers leads us to believe it is not a product of bad/corrupt data. It is just that these houses are further from city centers and in more distant neighborhoods which is why their prices are drastically different.

NA/Missing Values

barplot(colMeans(is.na(BostonHousing2)), las=2)

The barplot indicates that there are no missing values across any variables since all columns have a missing-value proportion of zero. This indicates that there is no need to remove or impute any records in this regard.

Feature Engineering

Some possibilities for feature engineering that we can do on this dataset include:

Combining features to capture more meaningful relationships

Here we combine features like rooms per dwelling and distance from Boston employment center to have a measure like how much residential space is available per distance from the city center.

Another feature I have is taxes/room which measures how much tax burden is there per unit of living space.

BostonHousing2$rooms_per_household <- BostonHousing2$rm / BostonHousing2$dis
BostonHousing2$tax_per_room <- BostonHousing2$tax / BostonHousing2$rm

Scaling and normalization of numeric variables

# Center and scale all numeric columns
numeric_cols <- sapply(BostonHousing2, is.numeric)
BostonHousing2_scaled <- BostonHousing2
BostonHousing2_scaled[, numeric_cols] <- scale(BostonHousing2[, numeric_cols])

Plots to visualize new engineered features

# Summary statistics for the new engineered features
summary(BostonHousing2[, c("rooms_per_household", "tax_per_room")])
##  rooms_per_household  tax_per_room   
##  Min.   :0.535       Min.   : 24.65  
##  1st Qu.:1.244       1st Qu.: 43.57  
##  Median :2.030       Median : 53.59  
##  Mean   :2.162       Mean   : 66.74  
##  3rd Qu.:2.935       3rd Qu.: 97.92  
##  Max.   :5.835       Max.   :187.03
# Correlation of new features with median home value (medv)
cor(BostonHousing2[, c("rooms_per_household", "tax_per_room", "medv")], use = "complete.obs")
##                     rooms_per_household tax_per_room       medv
## rooms_per_household           1.0000000    0.5507516 -0.1548300
## tax_per_room                  0.5507516    1.0000000 -0.5376497
## medv                         -0.1548300   -0.5376497  1.0000000
# Quick scatterplots to visualize relationships
library(ggplot2)
# Rooms per household vs. MEDV
ggplot(BostonHousing2, aes(x = rooms_per_household, y = medv)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  geom_smooth(method = "lm", se = FALSE, color = "darkred") +
  labs(title = "Rooms per Household vs Median Home Value",
       x = "Rooms per Household (rm/dis)",
       y = "Median Home Value (MEDV)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# Tax per room vs. MEDV
ggplot(BostonHousing2, aes(x = tax_per_room, y = medv)) +
  geom_point(alpha = 0.6, color = "seagreen") +
  geom_smooth(method = "lm", se = FALSE, color = "darkred") +
  labs(title = "Tax per Room vs Median Home Value",
       x = "Tax per Room (tax/rm)",
       y = "Median Home Value (MEDV)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Regression models and resampling tools

The code below compares the predictive performance of multiple regression models on the BostonHousing2 dataset. The code repeats (1000 simulations) and splits the data randomly into training and test sets.

The following linear regression models are fit on each training set:

  1. cmedv ~ rm — median value vs. average number of rooms

  2. cmedv ~ ptratio — median value vs. pupil–teacher ratio

  3. cmedv ~ lstat — median value vs. % lower-status population

  4. cmedv ~ rm + ptratio + lstat — multivariable model

  5. cmedv ~ all other variables — multivariable model

MSE decomposition:

Four models were compared using 1,000 simulations with 80-20 train-test splits: (1) RM only, (2) PTRATIO only, (3) LSTAT only, (4) RM + PTRATIO + LSTAT. Model 4 achieved lowest MSE, demonstrating that combining physical characteristics (RM), educational quality (PTRATIO), and socioeconomic status (LSTAT) captures complementary predictive dimensions.

library(MASS)
set.seed(123)

# -----------------------------
# 1. Data prep
# -----------------------------
# Drop identifiers, factors, and non-predictive columns
x <- BostonHousing2[, !(names(BostonHousing2) %in%
  c("townno", "tract", "medv", "chas", "rad", "outlier", "town"))]

n <- nrow(x)

# Initialize MSE totals
MSE.total <- numeric(5)
names(MSE.total) <- paste0("Model_", 1:5)

# -----------------------------
# 2. Simulation loop
# -----------------------------
for (i in 1:1000) {
  ind.train <- sample(1:n, round(0.8 * n), replace = FALSE)
  train <- x[ind.train, ]
  test  <- x[-ind.train, ]
  
  fit_1 <- lm(cmedv ~ rm, data = train)
  fit_2 <- lm(cmedv ~ ptratio, data = train)
  fit_3 <- lm(cmedv ~ lstat, data = train)
  fit_4 <- lm(cmedv ~ rm + ptratio + lstat + tax + indus, data = train)
  fit_5 <- lm(cmedv ~ ., data = train)
  
  pred1 <- predict(fit_1, newdata = test)
  pred2 <- predict(fit_2, newdata = test)
  pred3 <- predict(fit_3, newdata = test)
  pred4 <- predict(fit_4, newdata = test)
  pred5 <- predict(fit_5, newdata = test)
  
  MSE.total[1] <- MSE.total[1] + mean((pred1 - test$cmedv)^2)
  MSE.total[2] <- MSE.total[2] + mean((pred2 - test$cmedv)^2)
  MSE.total[3] <- MSE.total[3] + mean((pred3 - test$cmedv)^2)
  MSE.total[4] <- MSE.total[4] + mean((pred4 - test$cmedv)^2)
  MSE.total[5] <- MSE.total[5] + mean((pred5 - test$cmedv)^2)
}

# -----------------------------
# 3. Average MSEs
# -----------------------------
MSE.avg <- MSE.total / 1000
print(round(MSE.avg, 3))
## Model_1 Model_2 Model_3 Model_4 Model_5 
##  44.348  63.675  38.404  28.109  18.146
barplot(MSE.avg,
        names.arg = c("rm", "ptratio", "lstat",
                      "rm+ptratio+lstat+tax+indus", "all variables"),
        col = "steelblue",
        main = "Average Test MSE Across Models (1000 Simulations)",
        ylab = "Mean Squared Error",
        las = 2)

Cross validation (leave one out):

The code for both of these is in the Resampling Bias (session 3) directed study. Leave-one-out cross validation compared polynomial degrees using RM: intercept-only, linear (degree 1), quadratic (degree 2), and high-degree (degree 10). The quadratic model achieved optimal MSE, balancing flexibility and generalization. The degree-10 polynomial showed overfitting with higher CV error despite perfect training fit, illustrating the bias-variance tradeoff.

library(MASS)
set.seed(123)

# Prepping data
x <- BostonHousing2[, !(names(BostonHousing2) %in%
  c("townno", "tract", "medv", "chas", "rad", "outlier", "town"))]

n <- nrow(x)

# Initialize MSE accumulators
MSE.total <- numeric(5)
names(MSE.total) <- paste0("Model_", 1:5)

# Leave-One-Out Cross-Validation
for (i in 1:n) {
  # Train/test split
  train <- x[-i, ]
  test  <- x[i, , drop = FALSE]
  
  # Fit models on training data
  fit_1 <- lm(cmedv ~ rm, data = train)
  fit_2 <- lm(cmedv ~ ptratio, data = train)
  fit_3 <- lm(cmedv ~ lstat, data = train)
  fit_4 <- lm(cmedv ~ rm + ptratio + lstat + tax + indus, data = train)
  fit_5 <- lm(cmedv ~ ., data = train)
  
  # Predict the left-out observation
  pred1 <- predict(fit_1, newdata = test)
  pred2 <- predict(fit_2, newdata = test)
  pred3 <- predict(fit_3, newdata = test)
  pred4 <- predict(fit_4, newdata = test)
  pred5 <- predict(fit_5, newdata = test)
  
  # Add squared errors
  MSE.total[1] <- MSE.total[1] + (pred1 - test$cmedv)^2
  MSE.total[2] <- MSE.total[2] + (pred2 - test$cmedv)^2
  MSE.total[3] <- MSE.total[3] + (pred3 - test$cmedv)^2
  MSE.total[4] <- MSE.total[4] + (pred4 - test$cmedv)^2
  MSE.total[5] <- MSE.total[5] + (pred5 - test$cmedv)^2
}

# Avg MSEs
MSE.avg <- MSE.total / n
print(round(MSE.avg, 3))
## Model_1 Model_2 Model_3 Model_4 Model_5 
##  43.963  63.224  38.368  27.783  17.827
# Visualizing
barplot(MSE.avg,
        names.arg = c("rm", "ptratio", "lstat",
                      "rm+ptratio+lstat+tax+indus", "all variables"),
        col = "steelblue",
        main = "Leave-One-Out Cross-Validation (LOOCV) MSE",
        ylab = "Mean Squared Error",
        las = 2)

K-fold cross validation:

This code performs 10-fold cross-validation to compare the predictive performance of five linear regression models predicting median home values (cmedv) in the Boston Housing dataset. The data is randomly partitioned into 10 folds, with each fold serving as a test set while the remaining data trains the models, and the average Mean Squared Error (MSE) across all folds is calculated for each model. The models range from simple single-predictor regressions (rm, ptratio, lstat) to more complex models with multiple predictors, allowing us to evaluate which combination of features best predicts housing values.

library(MASS)
set.seed(123)

# Prepping data
x <- BostonHousing2[, !(names(BostonHousing2) %in%
  c("townno", "tract", "medv", "chas", "rad", "outlier", "town"))]

n <- nrow(x)
Kfolds <- 10
fold_id <- sample(rep(1:Kfolds, length.out = n))  # random fold assignments

MSE.total <- numeric(5)
names(MSE.total) <- paste0("Model_", 1:5)

# K-Fold Cross-Validation
for (k in 1:Kfolds) {
  # Split data
  train <- x[fold_id != k, ]
  test  <- x[fold_id == k, ]
  
  # Fit models
  fit_1 <- lm(cmedv ~ rm, data = train)
  fit_2 <- lm(cmedv ~ ptratio, data = train)
  fit_3 <- lm(cmedv ~ lstat, data = train)
  fit_4 <- lm(cmedv ~ rm + ptratio + lstat + tax + indus, data = train)
  fit_5 <- lm(cmedv ~ ., data = train)
  
  # Predictions
  pred1 <- predict(fit_1, newdata = test)
  pred2 <- predict(fit_2, newdata = test)
  pred3 <- predict(fit_3, newdata = test)
  pred4 <- predict(fit_4, newdata = test)
  pred5 <- predict(fit_5, newdata = test)
  
  # Compute MSEs
  MSE.total[1] <- MSE.total[1] + mean((pred1 - test$cmedv)^2)
  MSE.total[2] <- MSE.total[2] + mean((pred2 - test$cmedv)^2)
  MSE.total[3] <- MSE.total[3] + mean((pred3 - test$cmedv)^2)
  MSE.total[4] <- MSE.total[4] + mean((pred4 - test$cmedv)^2)
  MSE.total[5] <- MSE.total[5] + mean((pred5 - test$cmedv)^2)
}

MSE.avg <- MSE.total / Kfolds
print(round(MSE.avg, 3))
## Model_1 Model_2 Model_3 Model_4 Model_5 
##  44.480  63.320  38.537  28.048  17.962
# Visualization
barplot(MSE.avg,
        names.arg = c("rm", "ptratio", "lstat",
                      "rm+ptratio+lstat+tax+indus", "all variables"),
        col = "steelblue",
        main = paste(Kfolds, "-Fold Cross-Validation MSE", sep = ""),
        ylab = "Mean Squared Error",
        las = 2)

PCA

We applied Principal Component Analysis (PCA) to 13 continuous variables from the Boston Housing dataset (excluding identifiers, factor variables, and response variables like medv and cmedv). The data was standardized to ensure all features contribute equally, and we examined variance explained by each component and variable contributions to the first three principal components to interpret which housing characteristics drive the primary axes of variation. #### Data prep and scaling The selected continuous variables are standardized (mean = 0, SD = 1) to ensure all features contribute equally to the PCA, preventing variables with larger scales from dominating the analysis.

# Remove identifiers and factor/response columns
pca_data <- BostonHousing2[, -c(1, 2, 3, 4, 5, 10)]

# Standardize the data
new_data <- scale(pca_data)

# Quick visualization of scaled variable distributions
boxplot(new_data, main = "Boxplots of Standardized Variables in BostonHousing2")

PCA computation and variance summary

PCA is computed on the standardized data, and we examine how much variance each principal component explains through summary statistics and a scree plot to determine how many components effectively capture the dataset’s variability.

pca <- prcomp(new_data, scale = TRUE)
summary_pca <- summary(pca)

# Display variance explained
summary_pca
## Importance of components:
##                           PC1    PC2     PC3     PC4    PC5     PC6     PC7
## Standard deviation     2.8512 1.5309 1.19569 1.06220 0.9410 0.91834 0.84870
## Proportion of Variance 0.4516 0.1302 0.07943 0.06268 0.0492 0.04685 0.04002
## Cumulative Proportion  0.4516 0.5818 0.66127 0.72395 0.7732 0.82000 0.86001
##                            PC8     PC9   PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.76102 0.73466 0.5985 0.53588 0.47216 0.44548 0.38906
## Proportion of Variance 0.03217 0.02998 0.0199 0.01595 0.01239 0.01102 0.00841
## Cumulative Proportion  0.89219 0.92217 0.9421 0.95803 0.97041 0.98144 0.98985
##                           PC15    PC16    PC17    PC18
## Standard deviation     0.32698 0.25757 0.08954 0.03815
## Proportion of Variance 0.00594 0.00369 0.00045 0.00008
## Cumulative Proportion  0.99579 0.99947 0.99992 1.00000
# Mean variance explained across all PCs
mean(summary_pca$importance[2, ])
## [1] 0.05555611
# Scree plot of variance explained
fviz_eig(pca, addlabels = TRUE, barfill = "steelblue", barcolor = "black",
          main = "Scree Plot: Variance Explained by Principal Components")
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.

Correlation structure among variables

This geographic visualization provides spatial context for the data points, helping us understand whether geographic clustering might influence the underlying correlation structure among housing variables.

# Optional: geographic context visualization
plot(BostonHousing2$lon, BostonHousing2$lat,
     main = "Geographic Distribution of Data Points",
     xlab = "Longitude", ylab = "Latitude", col = "steelblue", pch = 19)

Variable contributions to principal components

We visualize which original variables contribute most strongly to the first three principal components (PC1, PC2, PC3), revealing which housing characteristics drive the primary axes of variation in the dataset.

# Contributions for the first 3 principal components
fviz_contrib(pca, choice = "var", axes = 1)

barplot(pca$rotation[,1], las = 2, main = "Variable Contributions to PC1")

fviz_contrib(pca, choice = "var", axes = 2)

barplot(pca$rotation[,2], las = 2, main = "Variable Contributions to PC2")

fviz_contrib(pca, choice = "var", axes = 3)

barplot(pca$rotation[,3], las = 2, main = "Variable Contributions to PC3")

Identify top contributing variables per component

For each of the first three principal components, we identify and display the five variables with the largest absolute loadings, highlighting which features most strongly define each component’s interpretation.

top_PC1 <- sort(abs(pca$rotation[,1]), decreasing = TRUE)[1:5]
top_PC2 <- sort(abs(pca$rotation[,2]), decreasing = TRUE)[1:5]
top_PC3 <- sort(abs(pca$rotation[,3]), decreasing = TRUE)[1:5]

cat("Top 5 variables for PC1:\n")
## Top 5 variables for PC1:
print(names(top_PC1))
## [1] "tax_per_room" "tax"          "indus"        "nox"          "lstat"
cat("\nTop 5 variables for PC2:\n")
## 
## Top 5 variables for PC2:
print(names(top_PC2))
## [1] "medv"                "cmedv"               "rm"                 
## [4] "rooms_per_household" "dis"
cat("\nTop 5 variables for PC3:\n")
## 
## Top 5 variables for PC3:
print(names(top_PC3))
## [1] "lat"  "rad"  "crim" "tax"  "chas"

Visualize top absolute loadings

Bar plots of absolute loadings show the relative importance of each variable within PC1, PC2, and PC3, making it easier to interpret what each principal component represents in terms of the original housing characteristics.

barplot(sort(abs(pca$rotation[,1]), decreasing = TRUE),
        main = "Absolute Loadings - PC1", las = 2, col = "steelblue")

barplot(sort(abs(pca$rotation[,2]), decreasing = TRUE),
        main = "Absolute Loadings - PC2", las = 2, col = "darkgreen")

barplot(sort(abs(pca$rotation[,3]), decreasing = TRUE),
        main = "Absolute Loadings - PC3", las = 2, col = "purple")

Factor Analysis

Here, we are doing factor analysis with a couple different models:

3-factor model (no rotation)

x.f <- factanal(new_data, factors = 3, rotation = "none", scores = "regression")
x.f
## 
## Call:
## factanal(x = new_data, factors = 3, scores = "regression", rotation = "none")
## 
## Uniquenesses:
##                 lat                medv               cmedv                crim 
##               0.893               0.005               0.005               0.626 
##               indus                chas                 nox                  rm 
##               0.279               0.934               0.215               0.511 
##                 age                 dis                 rad                 tax 
##               0.312               0.154               0.165               0.005 
##             ptratio                   b               lstat             outlier 
##               0.670               0.777               0.316               0.893 
## rooms_per_household        tax_per_room 
##               0.108               0.069 
## 
## Loadings:
##                     Factor1 Factor2 Factor3
## lat                  0.247  -0.114   0.182 
## medv                -0.952   0.300         
## cmedv               -0.953   0.297         
## crim                 0.510   0.327         
## indus                0.637   0.410   0.384 
## chas                -0.148   0.113   0.175 
## nox                  0.577   0.408   0.534 
## rm                  -0.657   0.238         
## age                  0.484   0.278   0.614 
## dis                 -0.394  -0.419  -0.718 
## rad                  0.614   0.675         
## tax                  0.712   0.697         
## ptratio              0.558          -0.107 
## b                   -0.417  -0.213         
## lstat                0.780           0.277 
## outlier             -0.273  -0.181         
## rooms_per_household  0.338   0.563   0.679 
## tax_per_room         0.755   0.600         
## 
##                Factor1 Factor2 Factor3
## SS loadings      6.454   2.659   1.955
## Proportion Var   0.359   0.148   0.109
## Cumulative Var   0.359   0.506   0.615
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 1624.97 on 102 degrees of freedom.
## The p-value is 1.5e-272
# Compare PCA vs FA loadings visually (Factor 1)
fa_loadings <- as.data.frame(x.f$loadings[, 1:3])
pca_loadings <- as.data.frame(pca$rotation[, 1:3])

plot(pca_loadings[,1], fa_loadings[,1],
     xlab = "PCA Loadings (PC1)", ylab = "Factor Loadings (F1)",
     main = "PCA vs. Factor Analysis Loadings (Factor 1)",
     pch = 19, col = "steelblue")
abline(a = 0, b = 1, col = "red", lty = 2)
text(pca_loadings[,1], fa_loadings[,1],
     labels = rownames(pca_loadings), pos = 3, cex = 0.7)

2-factor model (no rotation)

x.f2 <- factanal(new_data, factors = 2, rotation = "none", scores = "regression")
x.f2
## 
## Call:
## factanal(x = new_data, factors = 2, scores = "regression", rotation = "none")
## 
## Uniquenesses:
##                 lat                medv               cmedv                crim 
##               0.919               0.005               0.005               0.663 
##               indus                chas                 nox                  rm 
##               0.277               0.943               0.214               0.513 
##                 age                 dis                 rad                 tax 
##               0.380               0.249               0.459               0.351 
##             ptratio                   b               lstat             outlier 
##               0.725               0.793               0.317               0.912 
## rooms_per_household        tax_per_room 
##               0.159               0.338 
## 
## Loadings:
##                     Factor1 Factor2
## lat                  0.271         
## medv                -0.998         
## cmedv               -0.998         
## crim                 0.400   0.421 
## indus                0.501   0.687 
## chas                -0.172   0.165 
## nox                  0.446   0.766 
## rm                  -0.698         
## age                  0.393   0.683 
## dis                 -0.268  -0.824 
## rad                  0.400   0.618 
## tax                  0.487   0.642 
## ptratio              0.511   0.120 
## b                   -0.342  -0.300 
## lstat                0.749   0.350 
## outlier             -0.211  -0.207 
## rooms_per_household  0.175   0.900 
## tax_per_room         0.556   0.594 
## 
##                Factor1 Factor2
## SS loadings      5.138   4.643
## Proportion Var   0.285   0.258
## Cumulative Var   0.285   0.543
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 3301.36 on 118 degrees of freedom.
## The p-value is 0
# Compare PCA vs FA loadings visually (Factor 2)
plot(pca_loadings[,2], x.f2$loadings[,2],
     xlab = "PCA Loadings (PC2)", ylab = "Factor Loadings (F2)",
     main = "PCA vs. Factor Analysis Loadings (Factor 2)",
     pch = 19, col = "darkgreen")
abline(a = 0, b = 1, col = "red", lty = 2)
text(pca_loadings[,2], x.f2$loadings[,2],
     labels = rownames(pca_loadings), pos = 3, cex = 0.7)

3-factor model with varimax rotation

x.f_varimax <- factanal(new_data, factors = 3, rotation = "varimax", scores = "regression")
x.f_varimax
## 
## Call:
## factanal(x = new_data, factors = 3, scores = "regression", rotation = "varimax")
## 
## Uniquenesses:
##                 lat                medv               cmedv                crim 
##               0.893               0.005               0.005               0.626 
##               indus                chas                 nox                  rm 
##               0.279               0.934               0.215               0.511 
##                 age                 dis                 rad                 tax 
##               0.312               0.154               0.165               0.005 
##             ptratio                   b               lstat             outlier 
##               0.670               0.777               0.316               0.893 
## rooms_per_household        tax_per_room 
##               0.108               0.069 
## 
## Loadings:
##                     Factor1 Factor2 Factor3
## lat                          0.184  -0.270 
## medv                -0.300           0.948 
## cmedv               -0.304           0.947 
## crim                 0.505   0.261  -0.225 
## indus                0.542   0.589  -0.285 
## chas                         0.170   0.184 
## nox                  0.459   0.719  -0.239 
## rm                  -0.180           0.673 
## age                  0.281   0.742  -0.241 
## dis                 -0.305  -0.864         
## rad                  0.874   0.246  -0.103 
## tax                  0.945   0.269  -0.170 
## ptratio              0.410          -0.402 
## b                   -0.375  -0.189   0.215 
## lstat                0.351   0.403  -0.632 
## outlier             -0.291           0.116 
## rooms_per_household  0.397   0.856         
## tax_per_room         0.890   0.265  -0.262 
## 
##                Factor1 Factor2 Factor3
## SS loadings      4.273   3.454   3.340
## Proportion Var   0.237   0.192   0.186
## Cumulative Var   0.237   0.429   0.615
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 1624.97 on 102 degrees of freedom.
## The p-value is 1.5e-272
# Visualize rotated factor loadings
barplot(x.f_varimax$loadings[,1],
        main = "Varimax Rotated Factor 1 Loadings",
        col = "steelblue", las = 2)

barplot(x.f_varimax$loadings[,2],
        main = "Varimax Rotated Factor 2 Loadings",
        col = "darkgreen", las = 2)

barplot(x.f_varimax$loadings[,3],
        main = "Varimax Rotated Factor 3 Loadings",
        col = "purple", las = 2)

The factor analysis explains a similar proportion of variance as the principal component analysis (PCA). For example, the first three principal components account for about 72% of the variance, while the three extracted factors explain roughly 66%.

When comparing the factor loadings to the PCA loadings, the overall patterns are very consistent. The interpretation of Factor 1 is identical to PC1, showing the same variable relationships. For Factor 2, some loadings have opposite signs, but since the direction (positive vs. negative) in PCA and FA is arbitrary, the underlying relationships remain the same — meaning the interpretation aligns closely with PCA Dimension 2.

Clustering tools

Below we apply two different types of clustering methods: k-means and k-medoids.

Apply k-means for different values of k

# K-means
library(MASS)
library(dplyr)
library(tidyr)
library(ggplot2)
library(factoextra)
library(cluster)

set.seed(123)

# Data prep
x <- BostonHousing2[, !(names(BostonHousing2) %in%
  c("townno", "tract", "medv", "chas", "rad", "outlier", "town", "cmedv"))]

x_scaled <- scale(x)
k <- 5

# K-means clustering
fit.kmeans <- kmeans(x_scaled, centers = k, nstart = 1000)
groups <- fit.kmeans$cluster

# Barplot of cluster sizes
barplot(table(groups), col = "steelblue",
        main = "Cluster Sizes (K-means)",
        xlab = "Cluster", ylab = "Count")

# Cluster centers
centers <- fit.kmeans$centers
tidy <- as.data.frame(t(centers)) %>%
  gather(cluster, coor) %>%
  mutate(var = rep(colnames(centers), times = k),
         size = rep(table(groups), each = ncol(centers)))

# Visualizations
# Variable-centered view
tidy %>%
  ggplot(aes(x = cluster, y = coor, fill = cluster)) +
  geom_col() +
  facet_wrap(~var, scales = "free_y") +
  geom_text(aes(label = size),
            position = position_stack(1.2)) +
  theme_minimal() +
  labs(title = "Cluster Centers by Variable (K-means)")

# Cluster-centered view
tidy %>%
  ggplot(aes(x = var, y = coor, fill = var)) +
  geom_col() +
  facet_wrap(~cluster, scales = "free_y") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Variable Profiles per Cluster (K-means)") +
  theme_minimal()

# 2D Cluster Visualization
fviz_cluster(fit.kmeans, data = x_scaled, geom = "point",
             ellipse.type = "norm", main = "K-means Clusters") +
  theme_minimal()

K-medoids (PAM)

# K-medoids PAM clustering
library(factoextra)
library(cluster)
set.seed(123)

# PAM clustering
k <- 5
fit.pam <- eclust(x_scaled, FUNcluster = "pam", stand = TRUE,
                  k = k, graph = FALSE, nstart = 1000)

# Barplot of cluster sizes
barplot(table(fit.pam$cluster), col = "darkred",
        main = "Cluster Sizes (K-medoids / PAM)",
        xlab = "Cluster", ylab = "Count")

# Visualizing clusters
fviz_cluster(fit.pam, geom = "point", ellipse.type = "norm",
             main = "Clusters with PAM (K-medoids)",
             star.plot = TRUE, repel = TRUE) +
  theme_minimal()

Comparison of K-means and K-mediums: The K-means results show five main clusters that are pretty well separated in the PCA plot (about 45% of the variance on Dim1 and 10% on Dim2). Most clusters look compact and round, with red and green standing out the most. The yellow and purple ones overlap a bit in the lower-left area, probably meaning those neighborhoods share similar traits like mid-level crime or room counts but differ in things like taxes or income levels.

The PAM clustering looks very similar overall but forms tighter, cleaner groups. Each cluster is centered on an actual data point (the medoid), and the “stars” show which points belong to it. The borders are sharper, and a few overlapping points—especially between yellow and purple—get reassigned for a better fit.

In short, both methods found the same general neighborhood patterns. K-means gives slightly finer detail but is touchier about outliers, while PAM creates clearer, more stable clusters.